{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "355d651f",
   "metadata": {
    "origin_pos": 0
   },
   "source": [
    "# Maximum Likelihood\n",
    ":label:`sec_maximum_likelihood`\n",
    "\n",
    "One of the most commonly encountered way of thinking in machine learning is the maximum likelihood point of view.  This is the concept that when working with a probabilistic model with unknown parameters, the parameters which make the data have the highest probability are the most likely ones.\n",
    "\n",
    "## The Maximum Likelihood Principle\n",
    "\n",
    "This has a Bayesian interpretation which can be helpful to think about.  Suppose that we have a model with parameters $\\boldsymbol{\\theta}$ and a collection of data examples $X$.  For concreteness, we can imagine that $\\boldsymbol{\\theta}$ is a single value representing the probability that a coin comes up heads when flipped, and $X$ is a sequence of independent coin flips.  We will look at this example in depth later.\n",
    "\n",
    "If we want to find the most likely value for the parameters of our model, that means we want to find\n",
    "\n",
    "$$\\mathop{\\mathrm{argmax}} P(\\boldsymbol{\\theta}\\mid X).$$\n",
    ":eqlabel:`eq_max_like`\n",
    "\n",
    "By Bayes' rule, this is the same thing as\n",
    "\n",
    "$$\n",
    "\\mathop{\\mathrm{argmax}} \\frac{P(X \\mid \\boldsymbol{\\theta})P(\\boldsymbol{\\theta})}{P(X)}.\n",
    "$$\n",
    "\n",
    "The expression $P(X)$, a parameter agnostic probability of generating the data, does not depend on $\\boldsymbol{\\theta}$ at all, and so can be dropped without changing the best choice of $\\boldsymbol{\\theta}$.  Similarly, we may now posit that we have no prior assumption on which set of parameters are better than any others, so we may declare that $P(\\boldsymbol{\\theta})$ does not depend on theta either!  This, for instance, makes sense in our coin flipping example where the probability it comes up heads could be any value in $[0,1]$ without any prior belief it is fair or not (often referred to as an *uninformative prior*).  Thus we see that our application of Bayes' rule shows that our best choice of $\\boldsymbol{\\theta}$ is the maximum likelihood estimate for $\\boldsymbol{\\theta}$:\n",
    "\n",
    "$$\n",
    "\\hat{\\boldsymbol{\\theta}} = \\mathop{\\mathrm{argmax}} _ {\\boldsymbol{\\theta}} P(X \\mid \\boldsymbol{\\theta}).\n",
    "$$\n",
    "\n",
    "As a matter of common terminology, the probability of the data given the parameters ($P(X \\mid \\boldsymbol{\\theta})$) is referred to as the *likelihood*.\n",
    "\n",
    "### A Concrete Example\n",
    "\n",
    "Let's see how this works in a concrete example.  Suppose that we have a single parameter $\\theta$ representing the probability that a coin flip is heads.  Then the probability of getting a tails is $1-\\theta$, and so if our observed data $X$ is a sequence with $n_H$ heads and $n_T$ tails, we can use the fact that independent probabilities multiply to see that \n",
    "\n",
    "$$\n",
    "P(X \\mid \\theta) = \\theta^{n_H}(1-\\theta)^{n_T}.\n",
    "$$\n",
    "\n",
    "If we flip $13$ coins and get the sequence \"HHHTHTTHHHHHT\", which has $n_H = 9$ and $n_T = 4$, we see that this is\n",
    "\n",
    "$$\n",
    "P(X \\mid \\theta) = \\theta^9(1-\\theta)^4.\n",
    "$$\n",
    "\n",
    "One nice thing about this example will be that we know the answer going in.  Indeed, if we said verbally, \"I flipped 13 coins, and 9 came up heads, what is our best guess for the probability that the coin comes us heads?, \" everyone would correctly guess $9/13$.  What this maximum likelihood method will give us is a way to get that number from first principals in a way that will generalize to vastly more complex situations.\n",
    "\n",
    "For our example, the plot of $P(X \\mid \\theta)$ is as follows:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "c49e0252",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-08-18T19:26:41.930794Z",
     "iopub.status.busy": "2023-08-18T19:26:41.930116Z",
     "iopub.status.idle": "2023-08-18T19:26:45.229245Z",
     "shell.execute_reply": "2023-08-18T19:26:45.228375Z"
    },
    "origin_pos": 2,
    "tab": [
     "pytorch"
    ]
   },
   "outputs": [
    {
     "data": {
      "image/svg+xml": [
       "<?xml version=\"1.0\" encoding=\"utf-8\" standalone=\"no\"?>\n",
       "<!DOCTYPE svg PUBLIC \"-//W3C//DTD SVG 1.1//EN\"\n",
       "  \"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd\">\n",
       "<svg xmlns:xlink=\"http://www.w3.org/1999/xlink\" width=\"265.36875pt\" height=\"183.35625pt\" viewBox=\"0 0 265.36875 183.35625\" xmlns=\"http://www.w3.org/2000/svg\" version=\"1.1\">\n",
       " <metadata>\n",
       "  <rdf:RDF xmlns:dc=\"http://purl.org/dc/elements/1.1/\" xmlns:cc=\"http://creativecommons.org/ns#\" xmlns:rdf=\"http://www.w3.org/1999/02/22-rdf-syntax-ns#\">\n",
       "   <cc:Work>\n",
       "    <dc:type rdf:resource=\"http://purl.org/dc/dcmitype/StillImage\"/>\n",
       "    <dc:date>2023-08-18T19:26:45.189460</dc:date>\n",
       "    <dc:format>image/svg+xml</dc:format>\n",
       "    <dc:creator>\n",
       "     <cc:Agent>\n",
       "      <dc:title>Matplotlib v3.7.2, https://matplotlib.org/</dc:title>\n",
       "     </cc:Agent>\n",
       "    </dc:creator>\n",
       "   </cc:Work>\n",
       "  </rdf:RDF>\n",
       " </metadata>\n",
       " <defs>\n",
       "  <style type=\"text/css\">*{stroke-linejoin: round; stroke-linecap: butt}</style>\n",
       " </defs>\n",
       " <g id=\"figure_1\">\n",
       "  <g id=\"patch_1\">\n",
       "   <path d=\"M 0 183.35625 \n",
       "L 265.36875 183.35625 \n",
       "L 265.36875 0 \n",
       "L 0 0 \n",
       "z\n",
       "\" style=\"fill: #ffffff\"/>\n",
       "  </g>\n",
       "  <g id=\"axes_1\">\n",
       "   <g id=\"patch_2\">\n",
       "    <path d=\"M 62.86875 145.8 \n",
       "L 258.16875 145.8 \n",
       "L 258.16875 7.2 \n",
       "L 62.86875 7.2 \n",
       "z\n",
       "\" style=\"fill: #ffffff\"/>\n",
       "   </g>\n",
       "   <g id=\"matplotlib.axis_1\">\n",
       "    <g id=\"xtick_1\">\n",
       "     <g id=\"line2d_1\">\n",
       "      <path d=\"M 71.746023 145.8 \n",
       "L 71.746023 7.2 \n",
       "\" clip-path=\"url(#p86d50c39af)\" style=\"fill: none; stroke: #b0b0b0; stroke-width: 0.8; stroke-linecap: square\"/>\n",
       "     </g>\n",
       "     <g id=\"line2d_2\">\n",
       "      <defs>\n",
       "       <path id=\"m72bf54c325\" d=\"M 0 0 \n",
       "L 0 3.5 \n",
       "\" style=\"stroke: #000000; stroke-width: 0.8\"/>\n",
       "      </defs>\n",
       "      <g>\n",
       "       <use xlink:href=\"#m72bf54c325\" x=\"71.746023\" y=\"145.8\" style=\"stroke: #000000; stroke-width: 0.8\"/>\n",
       "      </g>\n",
       "     </g>\n",
       "     <g id=\"text_1\">\n",
       "      <!-- 0.0 -->\n",
       "      <g transform=\"translate(63.79446 160.398438) scale(0.1 -0.1)\">\n",
       "       <defs>\n",
       "        <path id=\"DejaVuSans-30\" d=\"M 2034 4250 \n",
       "Q 1547 4250 1301 3770 \n",
       "Q 1056 3291 1056 2328 \n",
       "Q 1056 1369 1301 889 \n",
       "Q 1547 409 2034 409 \n",
       "Q 2525 409 2770 889 \n",
       "Q 3016 1369 3016 2328 \n",
       "Q 3016 3291 2770 3770 \n",
       "Q 2525 4250 2034 4250 \n",
       "z\n",
       "M 2034 4750 \n",
       "Q 2819 4750 3233 4129 \n",
       "Q 3647 3509 3647 2328 \n",
       "Q 3647 1150 3233 529 \n",
       "Q 2819 -91 2034 -91 \n",
       "Q 1250 -91 836 529 \n",
       "Q 422 1150 422 2328 \n",
       "Q 422 3509 836 4129 \n",
       "Q 1250 4750 2034 4750 \n",
       "z\n",
       "\" transform=\"scale(0.015625)\"/>\n",
       "        <path id=\"DejaVuSans-2e\" d=\"M 684 794 \n",
       "L 1344 794 \n",
       "L 1344 0 \n",
       "L 684 0 \n",
       "L 684 794 \n",
       "z\n",
       "\" transform=\"scale(0.015625)\"/>\n",
       "       </defs>\n",
       "       <use xlink:href=\"#DejaVuSans-30\"/>\n",
       "       <use xlink:href=\"#DejaVuSans-2e\" x=\"63.623047\"/>\n",
       "       <use xlink:href=\"#DejaVuSans-30\" x=\"95.410156\"/>\n",
       "      </g>\n",
       "     </g>\n",
       "    </g>\n",
       "    <g id=\"xtick_2\">\n",
       "     <g id=\"line2d_3\">\n",
       "      <path d=\"M 107.290658 145.8 \n",
       "L 107.290658 7.2 \n",
       "\" clip-path=\"url(#p86d50c39af)\" style=\"fill: none; stroke: #b0b0b0; stroke-width: 0.8; stroke-linecap: square\"/>\n",
       "     </g>\n",
       "     <g id=\"line2d_4\">\n",
       "      <g>\n",
       "       <use xlink:href=\"#m72bf54c325\" x=\"107.290658\" y=\"145.8\" style=\"stroke: #000000; stroke-width: 0.8\"/>\n",
       "      </g>\n",
       "     </g>\n",
       "     <g id=\"text_2\">\n",
       "      <!-- 0.2 -->\n",
       "      <g transform=\"translate(99.339095 160.398438) scale(0.1 -0.1)\">\n",
       "       <defs>\n",
       "        <path id=\"DejaVuSans-32\" d=\"M 1228 531 \n",
       "L 3431 531 \n",
       "L 3431 0 \n",
       "L 469 0 \n",
       "L 469 531 \n",
       "Q 828 903 1448 1529 \n",
       "Q 2069 2156 2228 2338 \n",
       "Q 2531 2678 2651 2914 \n",
       "Q 2772 3150 2772 3378 \n",
       "Q 2772 3750 2511 3984 \n",
       "Q 2250 4219 1831 4219 \n",
       "Q 1534 4219 1204 4116 \n",
       "Q 875 4013 500 3803 \n",
       "L 500 4441 \n",
       "Q 881 4594 1212 4672 \n",
       "Q 1544 4750 1819 4750 \n",
       "Q 2544 4750 2975 4387 \n",
       "Q 3406 4025 3406 3419 \n",
       "Q 3406 3131 3298 2873 \n",
       "Q 3191 2616 2906 2266 \n",
       "Q 2828 2175 2409 1742 \n",
       "Q 1991 1309 1228 531 \n",
       "z\n",
       "\" transform=\"scale(0.015625)\"/>\n",
       "       </defs>\n",
       "       <use xlink:href=\"#DejaVuSans-30\"/>\n",
       "       <use xlink:href=\"#DejaVuSans-2e\" x=\"63.623047\"/>\n",
       "       <use xlink:href=\"#DejaVuSans-32\" x=\"95.410156\"/>\n",
       "      </g>\n",
       "     </g>\n",
       "    </g>\n",
       "    <g id=\"xtick_3\">\n",
       "     <g id=\"line2d_5\">\n",
       "      <path d=\"M 142.835293 145.8 \n",
       "L 142.835293 7.2 \n",
       "\" clip-path=\"url(#p86d50c39af)\" style=\"fill: none; stroke: #b0b0b0; stroke-width: 0.8; stroke-linecap: square\"/>\n",
       "     </g>\n",
       "     <g id=\"line2d_6\">\n",
       "      <g>\n",
       "       <use xlink:href=\"#m72bf54c325\" x=\"142.835293\" y=\"145.8\" style=\"stroke: #000000; stroke-width: 0.8\"/>\n",
       "      </g>\n",
       "     </g>\n",
       "     <g id=\"text_3\">\n",
       "      <!-- 0.4 -->\n",
       "      <g transform=\"translate(134.88373 160.398438) scale(0.1 -0.1)\">\n",
       "       <defs>\n",
       "        <path id=\"DejaVuSans-34\" d=\"M 2419 4116 \n",
       "L 825 1625 \n",
       "L 2419 1625 \n",
       "L 2419 4116 \n",
       "z\n",
       "M 2253 4666 \n",
       "L 3047 4666 \n",
       "L 3047 1625 \n",
       "L 3713 1625 \n",
       "L 3713 1100 \n",
       "L 3047 1100 \n",
       "L 3047 0 \n",
       "L 2419 0 \n",
       "L 2419 1100 \n",
       "L 313 1100 \n",
       "L 313 1709 \n",
       "L 2253 4666 \n",
       "z\n",
       "\" transform=\"scale(0.015625)\"/>\n",
       "       </defs>\n",
       "       <use xlink:href=\"#DejaVuSans-30\"/>\n",
       "       <use xlink:href=\"#DejaVuSans-2e\" x=\"63.623047\"/>\n",
       "       <use xlink:href=\"#DejaVuSans-34\" x=\"95.410156\"/>\n",
       "      </g>\n",
       "     </g>\n",
       "    </g>\n",
       "    <g id=\"xtick_4\">\n",
       "     <g id=\"line2d_7\">\n",
       "      <path d=\"M 178.379928 145.8 \n",
       "L 178.379928 7.2 \n",
       "\" clip-path=\"url(#p86d50c39af)\" style=\"fill: none; stroke: #b0b0b0; stroke-width: 0.8; stroke-linecap: square\"/>\n",
       "     </g>\n",
       "     <g id=\"line2d_8\">\n",
       "      <g>\n",
       "       <use xlink:href=\"#m72bf54c325\" x=\"178.379928\" y=\"145.8\" style=\"stroke: #000000; stroke-width: 0.8\"/>\n",
       "      </g>\n",
       "     </g>\n",
       "     <g id=\"text_4\">\n",
       "      <!-- 0.6 -->\n",
       "      <g transform=\"translate(170.428365 160.398438) scale(0.1 -0.1)\">\n",
       "       <defs>\n",
       "        <path id=\"DejaVuSans-36\" d=\"M 2113 2584 \n",
       "Q 1688 2584 1439 2293 \n",
       "Q 1191 2003 1191 1497 \n",
       "Q 1191 994 1439 701 \n",
       "Q 1688 409 2113 409 \n",
       "Q 2538 409 2786 701 \n",
       "Q 3034 994 3034 1497 \n",
       "Q 3034 2003 2786 2293 \n",
       "Q 2538 2584 2113 2584 \n",
       "z\n",
       "M 3366 4563 \n",
       "L 3366 3988 \n",
       "Q 3128 4100 2886 4159 \n",
       "Q 2644 4219 2406 4219 \n",
       "Q 1781 4219 1451 3797 \n",
       "Q 1122 3375 1075 2522 \n",
       "Q 1259 2794 1537 2939 \n",
       "Q 1816 3084 2150 3084 \n",
       "Q 2853 3084 3261 2657 \n",
       "Q 3669 2231 3669 1497 \n",
       "Q 3669 778 3244 343 \n",
       "Q 2819 -91 2113 -91 \n",
       "Q 1303 -91 875 529 \n",
       "Q 447 1150 447 2328 \n",
       "Q 447 3434 972 4092 \n",
       "Q 1497 4750 2381 4750 \n",
       "Q 2619 4750 2861 4703 \n",
       "Q 3103 4656 3366 4563 \n",
       "z\n",
       "\" transform=\"scale(0.015625)\"/>\n",
       "       </defs>\n",
       "       <use xlink:href=\"#DejaVuSans-30\"/>\n",
       "       <use xlink:href=\"#DejaVuSans-2e\" x=\"63.623047\"/>\n",
       "       <use xlink:href=\"#DejaVuSans-36\" x=\"95.410156\"/>\n",
       "      </g>\n",
       "     </g>\n",
       "    </g>\n",
       "    <g id=\"xtick_5\">\n",
       "     <g id=\"line2d_9\">\n",
       "      <path d=\"M 213.924563 145.8 \n",
       "L 213.924563 7.2 \n",
       "\" clip-path=\"url(#p86d50c39af)\" style=\"fill: none; stroke: #b0b0b0; stroke-width: 0.8; stroke-linecap: square\"/>\n",
       "     </g>\n",
       "     <g id=\"line2d_10\">\n",
       "      <g>\n",
       "       <use xlink:href=\"#m72bf54c325\" x=\"213.924563\" y=\"145.8\" style=\"stroke: #000000; stroke-width: 0.8\"/>\n",
       "      </g>\n",
       "     </g>\n",
       "     <g id=\"text_5\">\n",
       "      <!-- 0.8 -->\n",
       "      <g transform=\"translate(205.973001 160.398438) scale(0.1 -0.1)\">\n",
       "       <defs>\n",
       "        <path id=\"DejaVuSans-38\" d=\"M 2034 2216 \n",
       "Q 1584 2216 1326 1975 \n",
       "Q 1069 1734 1069 1313 \n",
       "Q 1069 891 1326 650 \n",
       "Q 1584 409 2034 409 \n",
       "Q 2484 409 2743 651 \n",
       "Q 3003 894 3003 1313 \n",
       "Q 3003 1734 2745 1975 \n",
       "Q 2488 2216 2034 2216 \n",
       "z\n",
       "M 1403 2484 \n",
       "Q 997 2584 770 2862 \n",
       "Q 544 3141 544 3541 \n",
       "Q 544 4100 942 4425 \n",
       "Q 1341 4750 2034 4750 \n",
       "Q 2731 4750 3128 4425 \n",
       "Q 3525 4100 3525 3541 \n",
       "Q 3525 3141 3298 2862 \n",
       "Q 3072 2584 2669 2484 \n",
       "Q 3125 2378 3379 2068 \n",
       "Q 3634 1759 3634 1313 \n",
       "Q 3634 634 3220 271 \n",
       "Q 2806 -91 2034 -91 \n",
       "Q 1263 -91 848 271 \n",
       "Q 434 634 434 1313 \n",
       "Q 434 1759 690 2068 \n",
       "Q 947 2378 1403 2484 \n",
       "z\n",
       "M 1172 3481 \n",
       "Q 1172 3119 1398 2916 \n",
       "Q 1625 2713 2034 2713 \n",
       "Q 2441 2713 2670 2916 \n",
       "Q 2900 3119 2900 3481 \n",
       "Q 2900 3844 2670 4047 \n",
       "Q 2441 4250 2034 4250 \n",
       "Q 1625 4250 1398 4047 \n",
       "Q 1172 3844 1172 3481 \n",
       "z\n",
       "\" transform=\"scale(0.015625)\"/>\n",
       "       </defs>\n",
       "       <use xlink:href=\"#DejaVuSans-30\"/>\n",
       "       <use xlink:href=\"#DejaVuSans-2e\" x=\"63.623047\"/>\n",
       "       <use xlink:href=\"#DejaVuSans-38\" x=\"95.410156\"/>\n",
       "      </g>\n",
       "     </g>\n",
       "    </g>\n",
       "    <g id=\"xtick_6\">\n",
       "     <g id=\"line2d_11\">\n",
       "      <path d=\"M 249.469198 145.8 \n",
       "L 249.469198 7.2 \n",
       "\" clip-path=\"url(#p86d50c39af)\" style=\"fill: none; stroke: #b0b0b0; stroke-width: 0.8; stroke-linecap: square\"/>\n",
       "     </g>\n",
       "     <g id=\"line2d_12\">\n",
       "      <g>\n",
       "       <use xlink:href=\"#m72bf54c325\" x=\"249.469198\" y=\"145.8\" style=\"stroke: #000000; stroke-width: 0.8\"/>\n",
       "      </g>\n",
       "     </g>\n",
       "     <g id=\"text_6\">\n",
       "      <!-- 1.0 -->\n",
       "      <g transform=\"translate(241.517636 160.398438) scale(0.1 -0.1)\">\n",
       "       <defs>\n",
       "        <path id=\"DejaVuSans-31\" d=\"M 794 531 \n",
       "L 1825 531 \n",
       "L 1825 4091 \n",
       "L 703 3866 \n",
       "L 703 4441 \n",
       "L 1819 4666 \n",
       "L 2450 4666 \n",
       "L 2450 531 \n",
       "L 3481 531 \n",
       "L 3481 0 \n",
       "L 794 0 \n",
       "L 794 531 \n",
       "z\n",
       "\" transform=\"scale(0.015625)\"/>\n",
       "       </defs>\n",
       "       <use xlink:href=\"#DejaVuSans-31\"/>\n",
       "       <use xlink:href=\"#DejaVuSans-2e\" x=\"63.623047\"/>\n",
       "       <use xlink:href=\"#DejaVuSans-30\" x=\"95.410156\"/>\n",
       "      </g>\n",
       "     </g>\n",
       "    </g>\n",
       "    <g id=\"text_7\">\n",
       "     <!-- theta -->\n",
       "     <g transform=\"translate(147.289062 174.076563) scale(0.1 -0.1)\">\n",
       "      <defs>\n",
       "       <path id=\"DejaVuSans-74\" d=\"M 1172 4494 \n",
       "L 1172 3500 \n",
       "L 2356 3500 \n",
       "L 2356 3053 \n",
       "L 1172 3053 \n",
       "L 1172 1153 \n",
       "Q 1172 725 1289 603 \n",
       "Q 1406 481 1766 481 \n",
       "L 2356 481 \n",
       "L 2356 0 \n",
       "L 1766 0 \n",
       "Q 1100 0 847 248 \n",
       "Q 594 497 594 1153 \n",
       "L 594 3053 \n",
       "L 172 3053 \n",
       "L 172 3500 \n",
       "L 594 3500 \n",
       "L 594 4494 \n",
       "L 1172 4494 \n",
       "z\n",
       "\" transform=\"scale(0.015625)\"/>\n",
       "       <path id=\"DejaVuSans-68\" d=\"M 3513 2113 \n",
       "L 3513 0 \n",
       "L 2938 0 \n",
       "L 2938 2094 \n",
       "Q 2938 2591 2744 2837 \n",
       "Q 2550 3084 2163 3084 \n",
       "Q 1697 3084 1428 2787 \n",
       "Q 1159 2491 1159 1978 \n",
       "L 1159 0 \n",
       "L 581 0 \n",
       "L 581 4863 \n",
       "L 1159 4863 \n",
       "L 1159 2956 \n",
       "Q 1366 3272 1645 3428 \n",
       "Q 1925 3584 2291 3584 \n",
       "Q 2894 3584 3203 3211 \n",
       "Q 3513 2838 3513 2113 \n",
       "z\n",
       "\" transform=\"scale(0.015625)\"/>\n",
       "       <path id=\"DejaVuSans-65\" d=\"M 3597 1894 \n",
       "L 3597 1613 \n",
       "L 953 1613 \n",
       "Q 991 1019 1311 708 \n",
       "Q 1631 397 2203 397 \n",
       "Q 2534 397 2845 478 \n",
       "Q 3156 559 3463 722 \n",
       "L 3463 178 \n",
       "Q 3153 47 2828 -22 \n",
       "Q 2503 -91 2169 -91 \n",
       "Q 1331 -91 842 396 \n",
       "Q 353 884 353 1716 \n",
       "Q 353 2575 817 3079 \n",
       "Q 1281 3584 2069 3584 \n",
       "Q 2775 3584 3186 3129 \n",
       "Q 3597 2675 3597 1894 \n",
       "z\n",
       "M 3022 2063 \n",
       "Q 3016 2534 2758 2815 \n",
       "Q 2500 3097 2075 3097 \n",
       "Q 1594 3097 1305 2825 \n",
       "Q 1016 2553 972 2059 \n",
       "L 3022 2063 \n",
       "z\n",
       "\" transform=\"scale(0.015625)\"/>\n",
       "       <path id=\"DejaVuSans-61\" d=\"M 2194 1759 \n",
       "Q 1497 1759 1228 1600 \n",
       "Q 959 1441 959 1056 \n",
       "Q 959 750 1161 570 \n",
       "Q 1363 391 1709 391 \n",
       "Q 2188 391 2477 730 \n",
       "Q 2766 1069 2766 1631 \n",
       "L 2766 1759 \n",
       "L 2194 1759 \n",
       "z\n",
       "M 3341 1997 \n",
       "L 3341 0 \n",
       "L 2766 0 \n",
       "L 2766 531 \n",
       "Q 2569 213 2275 61 \n",
       "Q 1981 -91 1556 -91 \n",
       "Q 1019 -91 701 211 \n",
       "Q 384 513 384 1019 \n",
       "Q 384 1609 779 1909 \n",
       "Q 1175 2209 1959 2209 \n",
       "L 2766 2209 \n",
       "L 2766 2266 \n",
       "Q 2766 2663 2505 2880 \n",
       "Q 2244 3097 1772 3097 \n",
       "Q 1472 3097 1187 3025 \n",
       "Q 903 2953 641 2809 \n",
       "L 641 3341 \n",
       "Q 956 3463 1253 3523 \n",
       "Q 1550 3584 1831 3584 \n",
       "Q 2591 3584 2966 3190 \n",
       "Q 3341 2797 3341 1997 \n",
       "z\n",
       "\" transform=\"scale(0.015625)\"/>\n",
       "      </defs>\n",
       "      <use xlink:href=\"#DejaVuSans-74\"/>\n",
       "      <use xlink:href=\"#DejaVuSans-68\" x=\"39.208984\"/>\n",
       "      <use xlink:href=\"#DejaVuSans-65\" x=\"102.587891\"/>\n",
       "      <use xlink:href=\"#DejaVuSans-74\" x=\"164.111328\"/>\n",
       "      <use xlink:href=\"#DejaVuSans-61\" x=\"203.320312\"/>\n",
       "     </g>\n",
       "    </g>\n",
       "   </g>\n",
       "   <g id=\"matplotlib.axis_2\">\n",
       "    <g id=\"ytick_1\">\n",
       "     <g id=\"line2d_13\">\n",
       "      <path d=\"M 62.86875 139.5 \n",
       "L 258.16875 139.5 \n",
       "\" clip-path=\"url(#p86d50c39af)\" style=\"fill: none; stroke: #b0b0b0; stroke-width: 0.8; stroke-linecap: square\"/>\n",
       "     </g>\n",
       "     <g id=\"line2d_14\">\n",
       "      <defs>\n",
       "       <path id=\"mf8e41c6aa3\" d=\"M 0 0 \n",
       "L -3.5 0 \n",
       "\" style=\"stroke: #000000; stroke-width: 0.8\"/>\n",
       "      </defs>\n",
       "      <g>\n",
       "       <use xlink:href=\"#mf8e41c6aa3\" x=\"62.86875\" y=\"139.5\" style=\"stroke: #000000; stroke-width: 0.8\"/>\n",
       "      </g>\n",
       "     </g>\n",
       "     <g id=\"text_8\">\n",
       "      <!-- 0.0000 -->\n",
       "      <g transform=\"translate(20.878125 143.299219) scale(0.1 -0.1)\">\n",
       "       <use xlink:href=\"#DejaVuSans-30\"/>\n",
       "       <use xlink:href=\"#DejaVuSans-2e\" x=\"63.623047\"/>\n",
       "       <use xlink:href=\"#DejaVuSans-30\" x=\"95.410156\"/>\n",
       "       <use xlink:href=\"#DejaVuSans-30\" x=\"159.033203\"/>\n",
       "       <use xlink:href=\"#DejaVuSans-30\" x=\"222.65625\"/>\n",
       "       <use xlink:href=\"#DejaVuSans-30\" x=\"286.279297\"/>\n",
       "      </g>\n",
       "     </g>\n",
       "    </g>\n",
       "    <g id=\"ytick_2\">\n",
       "     <g id=\"line2d_15\">\n",
       "      <path d=\"M 62.86875 101.021967 \n",
       "L 258.16875 101.021967 \n",
       "\" clip-path=\"url(#p86d50c39af)\" style=\"fill: none; stroke: #b0b0b0; stroke-width: 0.8; stroke-linecap: square\"/>\n",
       "     </g>\n",
       "     <g id=\"line2d_16\">\n",
       "      <g>\n",
       "       <use xlink:href=\"#mf8e41c6aa3\" x=\"62.86875\" y=\"101.021967\" style=\"stroke: #000000; stroke-width: 0.8\"/>\n",
       "      </g>\n",
       "     </g>\n",
       "     <g id=\"text_9\">\n",
       "      <!-- 0.0001 -->\n",
       "      <g transform=\"translate(20.878125 104.821186) scale(0.1 -0.1)\">\n",
       "       <use xlink:href=\"#DejaVuSans-30\"/>\n",
       "       <use xlink:href=\"#DejaVuSans-2e\" x=\"63.623047\"/>\n",
       "       <use xlink:href=\"#DejaVuSans-30\" x=\"95.410156\"/>\n",
       "       <use xlink:href=\"#DejaVuSans-30\" x=\"159.033203\"/>\n",
       "       <use xlink:href=\"#DejaVuSans-30\" x=\"222.65625\"/>\n",
       "       <use xlink:href=\"#DejaVuSans-31\" x=\"286.279297\"/>\n",
       "      </g>\n",
       "     </g>\n",
       "    </g>\n",
       "    <g id=\"ytick_3\">\n",
       "     <g id=\"line2d_17\">\n",
       "      <path d=\"M 62.86875 62.543934 \n",
       "L 258.16875 62.543934 \n",
       "\" clip-path=\"url(#p86d50c39af)\" style=\"fill: none; stroke: #b0b0b0; stroke-width: 0.8; stroke-linecap: square\"/>\n",
       "     </g>\n",
       "     <g id=\"line2d_18\">\n",
       "      <g>\n",
       "       <use xlink:href=\"#mf8e41c6aa3\" x=\"62.86875\" y=\"62.543934\" style=\"stroke: #000000; stroke-width: 0.8\"/>\n",
       "      </g>\n",
       "     </g>\n",
       "     <g id=\"text_10\">\n",
       "      <!-- 0.0002 -->\n",
       "      <g transform=\"translate(20.878125 66.343153) scale(0.1 -0.1)\">\n",
       "       <use xlink:href=\"#DejaVuSans-30\"/>\n",
       "       <use xlink:href=\"#DejaVuSans-2e\" x=\"63.623047\"/>\n",
       "       <use xlink:href=\"#DejaVuSans-30\" x=\"95.410156\"/>\n",
       "       <use xlink:href=\"#DejaVuSans-30\" x=\"159.033203\"/>\n",
       "       <use xlink:href=\"#DejaVuSans-30\" x=\"222.65625\"/>\n",
       "       <use xlink:href=\"#DejaVuSans-32\" x=\"286.279297\"/>\n",
       "      </g>\n",
       "     </g>\n",
       "    </g>\n",
       "    <g id=\"ytick_4\">\n",
       "     <g id=\"line2d_19\">\n",
       "      <path d=\"M 62.86875 24.065901 \n",
       "L 258.16875 24.065901 \n",
       "\" clip-path=\"url(#p86d50c39af)\" style=\"fill: none; stroke: #b0b0b0; stroke-width: 0.8; stroke-linecap: square\"/>\n",
       "     </g>\n",
       "     <g id=\"line2d_20\">\n",
       "      <g>\n",
       "       <use xlink:href=\"#mf8e41c6aa3\" x=\"62.86875\" y=\"24.065901\" style=\"stroke: #000000; stroke-width: 0.8\"/>\n",
       "      </g>\n",
       "     </g>\n",
       "     <g id=\"text_11\">\n",
       "      <!-- 0.0003 -->\n",
       "      <g transform=\"translate(20.878125 27.865119) scale(0.1 -0.1)\">\n",
       "       <defs>\n",
       "        <path id=\"DejaVuSans-33\" d=\"M 2597 2516 \n",
       "Q 3050 2419 3304 2112 \n",
       "Q 3559 1806 3559 1356 \n",
       "Q 3559 666 3084 287 \n",
       "Q 2609 -91 1734 -91 \n",
       "Q 1441 -91 1130 -33 \n",
       "Q 819 25 488 141 \n",
       "L 488 750 \n",
       "Q 750 597 1062 519 \n",
       "Q 1375 441 1716 441 \n",
       "Q 2309 441 2620 675 \n",
       "Q 2931 909 2931 1356 \n",
       "Q 2931 1769 2642 2001 \n",
       "Q 2353 2234 1838 2234 \n",
       "L 1294 2234 \n",
       "L 1294 2753 \n",
       "L 1863 2753 \n",
       "Q 2328 2753 2575 2939 \n",
       "Q 2822 3125 2822 3475 \n",
       "Q 2822 3834 2567 4026 \n",
       "Q 2313 4219 1838 4219 \n",
       "Q 1578 4219 1281 4162 \n",
       "Q 984 4106 628 3988 \n",
       "L 628 4550 \n",
       "Q 988 4650 1302 4700 \n",
       "Q 1616 4750 1894 4750 \n",
       "Q 2613 4750 3031 4423 \n",
       "Q 3450 4097 3450 3541 \n",
       "Q 3450 3153 3228 2886 \n",
       "Q 3006 2619 2597 2516 \n",
       "z\n",
       "\" transform=\"scale(0.015625)\"/>\n",
       "       </defs>\n",
       "       <use xlink:href=\"#DejaVuSans-30\"/>\n",
       "       <use xlink:href=\"#DejaVuSans-2e\" x=\"63.623047\"/>\n",
       "       <use xlink:href=\"#DejaVuSans-30\" x=\"95.410156\"/>\n",
       "       <use xlink:href=\"#DejaVuSans-30\" x=\"159.033203\"/>\n",
       "       <use xlink:href=\"#DejaVuSans-30\" x=\"222.65625\"/>\n",
       "       <use xlink:href=\"#DejaVuSans-33\" x=\"286.279297\"/>\n",
       "      </g>\n",
       "     </g>\n",
       "    </g>\n",
       "    <g id=\"text_12\">\n",
       "     <!-- likelihood -->\n",
       "     <g transform=\"translate(14.798437 100.308594) rotate(-90) scale(0.1 -0.1)\">\n",
       "      <defs>\n",
       "       <path id=\"DejaVuSans-6c\" d=\"M 603 4863 \n",
       "L 1178 4863 \n",
       "L 1178 0 \n",
       "L 603 0 \n",
       "L 603 4863 \n",
       "z\n",
       "\" transform=\"scale(0.015625)\"/>\n",
       "       <path id=\"DejaVuSans-69\" d=\"M 603 3500 \n",
       "L 1178 3500 \n",
       "L 1178 0 \n",
       "L 603 0 \n",
       "L 603 3500 \n",
       "z\n",
       "M 603 4863 \n",
       "L 1178 4863 \n",
       "L 1178 4134 \n",
       "L 603 4134 \n",
       "L 603 4863 \n",
       "z\n",
       "\" transform=\"scale(0.015625)\"/>\n",
       "       <path id=\"DejaVuSans-6b\" d=\"M 581 4863 \n",
       "L 1159 4863 \n",
       "L 1159 1991 \n",
       "L 2875 3500 \n",
       "L 3609 3500 \n",
       "L 1753 1863 \n",
       "L 3688 0 \n",
       "L 2938 0 \n",
       "L 1159 1709 \n",
       "L 1159 0 \n",
       "L 581 0 \n",
       "L 581 4863 \n",
       "z\n",
       "\" transform=\"scale(0.015625)\"/>\n",
       "       <path id=\"DejaVuSans-6f\" d=\"M 1959 3097 \n",
       "Q 1497 3097 1228 2736 \n",
       "Q 959 2375 959 1747 \n",
       "Q 959 1119 1226 758 \n",
       "Q 1494 397 1959 397 \n",
       "Q 2419 397 2687 759 \n",
       "Q 2956 1122 2956 1747 \n",
       "Q 2956 2369 2687 2733 \n",
       "Q 2419 3097 1959 3097 \n",
       "z\n",
       "M 1959 3584 \n",
       "Q 2709 3584 3137 3096 \n",
       "Q 3566 2609 3566 1747 \n",
       "Q 3566 888 3137 398 \n",
       "Q 2709 -91 1959 -91 \n",
       "Q 1206 -91 779 398 \n",
       "Q 353 888 353 1747 \n",
       "Q 353 2609 779 3096 \n",
       "Q 1206 3584 1959 3584 \n",
       "z\n",
       "\" transform=\"scale(0.015625)\"/>\n",
       "       <path id=\"DejaVuSans-64\" d=\"M 2906 2969 \n",
       "L 2906 4863 \n",
       "L 3481 4863 \n",
       "L 3481 0 \n",
       "L 2906 0 \n",
       "L 2906 525 \n",
       "Q 2725 213 2448 61 \n",
       "Q 2172 -91 1784 -91 \n",
       "Q 1150 -91 751 415 \n",
       "Q 353 922 353 1747 \n",
       "Q 353 2572 751 3078 \n",
       "Q 1150 3584 1784 3584 \n",
       "Q 2172 3584 2448 3432 \n",
       "Q 2725 3281 2906 2969 \n",
       "z\n",
       "M 947 1747 \n",
       "Q 947 1113 1208 752 \n",
       "Q 1469 391 1925 391 \n",
       "Q 2381 391 2643 752 \n",
       "Q 2906 1113 2906 1747 \n",
       "Q 2906 2381 2643 2742 \n",
       "Q 2381 3103 1925 3103 \n",
       "Q 1469 3103 1208 2742 \n",
       "Q 947 2381 947 1747 \n",
       "z\n",
       "\" transform=\"scale(0.015625)\"/>\n",
       "      </defs>\n",
       "      <use xlink:href=\"#DejaVuSans-6c\"/>\n",
       "      <use xlink:href=\"#DejaVuSans-69\" x=\"27.783203\"/>\n",
       "      <use xlink:href=\"#DejaVuSans-6b\" x=\"55.566406\"/>\n",
       "      <use xlink:href=\"#DejaVuSans-65\" x=\"109.851562\"/>\n",
       "      <use xlink:href=\"#DejaVuSans-6c\" x=\"171.375\"/>\n",
       "      <use xlink:href=\"#DejaVuSans-69\" x=\"199.158203\"/>\n",
       "      <use xlink:href=\"#DejaVuSans-68\" x=\"226.941406\"/>\n",
       "      <use xlink:href=\"#DejaVuSans-6f\" x=\"290.320312\"/>\n",
       "      <use xlink:href=\"#DejaVuSans-6f\" x=\"351.501953\"/>\n",
       "      <use xlink:href=\"#DejaVuSans-64\" x=\"412.683594\"/>\n",
       "     </g>\n",
       "    </g>\n",
       "   </g>\n",
       "   <g id=\"line2d_21\">\n",
       "    <path d=\"M 71.746023 139.5 \n",
       "L 108.712444 139.389673 \n",
       "L 114.755031 139.1384 \n",
       "L 119.020388 138.755385 \n",
       "L 122.574853 138.219734 \n",
       "L 125.596145 137.545086 \n",
       "L 128.261996 136.731818 \n",
       "L 130.75012 135.745418 \n",
       "L 133.060518 134.596007 \n",
       "L 135.370921 133.186063 \n",
       "L 137.503598 131.62246 \n",
       "L 139.636276 129.778414 \n",
       "L 141.768953 127.626759 \n",
       "L 143.901635 125.14203 \n",
       "L 146.034312 122.301462 \n",
       "L 148.166989 119.085984 \n",
       "L 150.299667 115.481271 \n",
       "L 152.61007 111.127251 \n",
       "L 154.920468 106.30505 \n",
       "L 157.408592 100.599501 \n",
       "L 160.074437 93.928618 \n",
       "L 163.095735 85.744115 \n",
       "L 166.650195 75.433979 \n",
       "L 172.337337 58.119634 \n",
       "L 177.313586 43.251283 \n",
       "L 180.157152 35.402634 \n",
       "L 182.467566 29.606819 \n",
       "L 184.600248 24.856907 \n",
       "L 186.377467 21.423444 \n",
       "L 187.976977 18.79731 \n",
       "L 189.398765 16.868634 \n",
       "L 190.642822 15.517648 \n",
       "L 191.886889 14.499093 \n",
       "L 192.953225 13.902578 \n",
       "L 194.019572 13.570585 \n",
       "L 194.908187 13.501467 \n",
       "L 195.796791 13.62501 \n",
       "L 196.685417 13.944259 \n",
       "L 197.751753 14.588928 \n",
       "L 198.818099 15.521422 \n",
       "L 199.884435 16.742773 \n",
       "L 201.128492 18.531892 \n",
       "L 202.372559 20.709311 \n",
       "L 203.794347 23.663709 \n",
       "L 205.393846 27.561309 \n",
       "L 207.171076 32.566078 \n",
       "L 209.126038 38.822669 \n",
       "L 211.25871 46.433427 \n",
       "L 213.746844 56.147614 \n",
       "L 217.301304 71.032847 \n",
       "L 223.699341 97.968294 \n",
       "L 226.365186 108.119528 \n",
       "L 228.497858 115.403251 \n",
       "L 230.45282 121.285762 \n",
       "L 232.23006 125.901569 \n",
       "L 233.829559 129.428433 \n",
       "L 235.251347 132.058457 \n",
       "L 236.673125 134.22193 \n",
       "L 237.917192 135.748796 \n",
       "L 239.161249 136.956781 \n",
       "L 240.405316 137.874865 \n",
       "L 241.649384 138.53807 \n",
       "L 243.071161 139.03536 \n",
       "L 244.670671 139.340161 \n",
       "L 246.803353 139.482998 \n",
       "L 249.291477 139.5 \n",
       "L 249.291477 139.5 \n",
       "\" clip-path=\"url(#p86d50c39af)\" style=\"fill: none; stroke: #1f77b4; stroke-width: 1.5; stroke-linecap: square\"/>\n",
       "   </g>\n",
       "   <g id=\"patch_3\">\n",
       "    <path d=\"M 62.86875 145.8 \n",
       "L 62.86875 7.2 \n",
       "\" style=\"fill: none; stroke: #000000; stroke-width: 0.8; stroke-linejoin: miter; stroke-linecap: square\"/>\n",
       "   </g>\n",
       "   <g id=\"patch_4\">\n",
       "    <path d=\"M 258.16875 145.8 \n",
       "L 258.16875 7.2 \n",
       "\" style=\"fill: none; stroke: #000000; stroke-width: 0.8; stroke-linejoin: miter; stroke-linecap: square\"/>\n",
       "   </g>\n",
       "   <g id=\"patch_5\">\n",
       "    <path d=\"M 62.86875 145.8 \n",
       "L 258.16875 145.8 \n",
       "\" style=\"fill: none; stroke: #000000; stroke-width: 0.8; stroke-linejoin: miter; stroke-linecap: square\"/>\n",
       "   </g>\n",
       "   <g id=\"patch_6\">\n",
       "    <path d=\"M 62.86875 7.2 \n",
       "L 258.16875 7.2 \n",
       "\" style=\"fill: none; stroke: #000000; stroke-width: 0.8; stroke-linejoin: miter; stroke-linecap: square\"/>\n",
       "   </g>\n",
       "  </g>\n",
       " </g>\n",
       " <defs>\n",
       "  <clipPath id=\"p86d50c39af\">\n",
       "   <rect x=\"62.86875\" y=\"7.2\" width=\"195.3\" height=\"138.6\"/>\n",
       "  </clipPath>\n",
       " </defs>\n",
       "</svg>\n"
      ],
      "text/plain": [
       "<Figure size 350x250 with 1 Axes>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "%matplotlib inline\n",
    "import torch\n",
    "from d2l import torch as d2l\n",
    "\n",
    "theta = torch.arange(0, 1, 0.001)\n",
    "p = theta**9 * (1 - theta)**4.\n",
    "\n",
    "d2l.plot(theta, p, 'theta', 'likelihood')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2c7f574b",
   "metadata": {
    "origin_pos": 4
   },
   "source": [
    "This has its maximum value somewhere near our expected $9/13 \\approx 0.7\\ldots$.  To see if it is exactly there, we can turn to calculus.  Notice that at the maximum, the gradient of the function is flat.  Thus, we could find the maximum likelihood estimate :eqref:`eq_max_like` by finding the values of $\\theta$ where the derivative is zero, and finding the one that gives the highest probability.  We compute:\n",
    "\n",
    "$$\n",
    "\\begin{aligned}\n",
    "0 & = \\frac{d}{d\\theta} P(X \\mid \\theta) \\\\\n",
    "& = \\frac{d}{d\\theta} \\theta^9(1-\\theta)^4 \\\\\n",
    "& = 9\\theta^8(1-\\theta)^4 - 4\\theta^9(1-\\theta)^3 \\\\\n",
    "& = \\theta^8(1-\\theta)^3(9-13\\theta).\n",
    "\\end{aligned}\n",
    "$$\n",
    "\n",
    "This has three solutions: $0$, $1$ and $9/13$.  The first two are clearly minima, not maxima as they assign probability $0$ to our sequence.  The final value does *not* assign zero probability to our sequence, and thus must be the maximum likelihood estimate $\\hat \\theta = 9/13$.\n",
    "\n",
    "## Numerical Optimization and the Negative Log-Likelihood\n",
    "\n",
    "The previous example is nice, but what if we have billions of parameters and data examples?\n",
    "\n",
    "First, notice that if we make the assumption that all the data examples are independent, we can no longer practically consider the likelihood itself as it is a product of many probabilities.  Indeed, each probability is in $[0,1]$, say typically of value about $1/2$, and the product of $(1/2)^{1000000000}$ is far below machine precision.  We cannot work with that directly.  \n",
    "\n",
    "However, recall that the logarithm turns products to sums, in which case \n",
    "\n",
    "$$\n",
    "\\log((1/2)^{1000000000}) = 1000000000\\cdot\\log(1/2) \\approx -301029995.6\\ldots\n",
    "$$\n",
    "\n",
    "This number fits perfectly within even a single precision $32$-bit float.  Thus, we should consider the *log-likelihood*, which is\n",
    "\n",
    "$$\n",
    "\\log(P(X \\mid \\boldsymbol{\\theta})).\n",
    "$$\n",
    "\n",
    "Since the function $x \\mapsto \\log(x)$ is increasing, maximizing the likelihood is the same thing as maximizing the log-likelihood.  Indeed in :numref:`sec_naive_bayes` we will see this reasoning applied when working with the specific example of the naive Bayes classifier.\n",
    "\n",
    "We often work with loss functions, where we wish to minimize the loss.  We may turn maximum likelihood into the minimization of a loss by taking $-\\log(P(X \\mid \\boldsymbol{\\theta}))$, which is the *negative log-likelihood*.\n",
    "\n",
    "To illustrate this, consider the coin flipping problem from before, and pretend that we do not know the closed form solution.  We may compute that\n",
    "\n",
    "$$\n",
    "-\\log(P(X \\mid \\boldsymbol{\\theta})) = -\\log(\\theta^{n_H}(1-\\theta)^{n_T}) = -(n_H\\log(\\theta) + n_T\\log(1-\\theta)).\n",
    "$$\n",
    "\n",
    "This can be written into code, and freely optimized even for billions of coin flips.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "87cb6fe2",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-08-18T19:26:45.233274Z",
     "iopub.status.busy": "2023-08-18T19:26:45.232595Z",
     "iopub.status.idle": "2023-08-18T19:26:45.322872Z",
     "shell.execute_reply": "2023-08-18T19:26:45.322023Z"
    },
    "origin_pos": 6,
    "tab": [
     "pytorch"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(tensor(0.9713, requires_grad=True), 0.9713101437890875)"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Set up our data\n",
    "n_H = 8675309\n",
    "n_T = 256245\n",
    "\n",
    "# Initialize our paramteres\n",
    "theta = torch.tensor(0.5, requires_grad=True)\n",
    "\n",
    "# Perform gradient descent\n",
    "lr = 1e-9\n",
    "for iter in range(100):\n",
    "    loss = -(n_H * torch.log(theta) + n_T * torch.log(1 - theta))\n",
    "    loss.backward()\n",
    "    with torch.no_grad():\n",
    "        theta -= lr * theta.grad\n",
    "    theta.grad.zero_()\n",
    "\n",
    "# Check output\n",
    "theta, n_H / (n_H + n_T)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e9d29b67",
   "metadata": {
    "origin_pos": 8
   },
   "source": [
    "Numerical convenience is not the only reason why people like to use negative log-likelihoods. There are several other reasons why it is preferable.\n",
    "\n",
    "\n",
    "\n",
    "The second reason we consider the log-likelihood is the simplified application of calculus rules. As discussed above, due to independence assumptions, most probabilities we encounter in machine learning are products of individual probabilities.\n",
    "\n",
    "$$\n",
    "P(X\\mid\\boldsymbol{\\theta}) = p(x_1\\mid\\boldsymbol{\\theta})\\cdot p(x_2\\mid\\boldsymbol{\\theta})\\cdots p(x_n\\mid\\boldsymbol{\\theta}).\n",
    "$$\n",
    "\n",
    "This means that if we directly apply the product rule to compute a derivative we get\n",
    "\n",
    "$$\n",
    "\\begin{aligned}\n",
    "\\frac{\\partial}{\\partial \\boldsymbol{\\theta}} P(X\\mid\\boldsymbol{\\theta}) & = \\left(\\frac{\\partial}{\\partial \\boldsymbol{\\theta}}P(x_1\\mid\\boldsymbol{\\theta})\\right)\\cdot P(x_2\\mid\\boldsymbol{\\theta})\\cdots P(x_n\\mid\\boldsymbol{\\theta}) \\\\\n",
    "& \\quad + P(x_1\\mid\\boldsymbol{\\theta})\\cdot \\left(\\frac{\\partial}{\\partial \\boldsymbol{\\theta}}P(x_2\\mid\\boldsymbol{\\theta})\\right)\\cdots P(x_n\\mid\\boldsymbol{\\theta}) \\\\\n",
    "& \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\vdots \\\\\n",
    "& \\quad + P(x_1\\mid\\boldsymbol{\\theta})\\cdot P(x_2\\mid\\boldsymbol{\\theta}) \\cdots \\left(\\frac{\\partial}{\\partial \\boldsymbol{\\theta}}P(x_n\\mid\\boldsymbol{\\theta})\\right).\n",
    "\\end{aligned}\n",
    "$$\n",
    "\n",
    "This requires $n(n-1)$ multiplications, along with $(n-1)$ additions, so it is proportional to quadratic time in the inputs!  Sufficient cleverness in grouping terms will reduce this to linear time, but it requires some thought.  For the negative log-likelihood we have instead\n",
    "\n",
    "$$\n",
    "-\\log\\left(P(X\\mid\\boldsymbol{\\theta})\\right) = -\\log(P(x_1\\mid\\boldsymbol{\\theta})) - \\log(P(x_2\\mid\\boldsymbol{\\theta})) \\cdots - \\log(P(x_n\\mid\\boldsymbol{\\theta})),\n",
    "$$\n",
    "\n",
    "which then gives\n",
    "\n",
    "$$\n",
    "- \\frac{\\partial}{\\partial \\boldsymbol{\\theta}} \\log\\left(P(X\\mid\\boldsymbol{\\theta})\\right) = \\frac{1}{P(x_1\\mid\\boldsymbol{\\theta})}\\left(\\frac{\\partial}{\\partial \\boldsymbol{\\theta}}P(x_1\\mid\\boldsymbol{\\theta})\\right) + \\cdots + \\frac{1}{P(x_n\\mid\\boldsymbol{\\theta})}\\left(\\frac{\\partial}{\\partial \\boldsymbol{\\theta}}P(x_n\\mid\\boldsymbol{\\theta})\\right).\n",
    "$$\n",
    "\n",
    "This requires only $n$ divides and $n-1$ sums, and thus is linear time in the inputs.\n",
    "\n",
    "The third and final reason to consider the negative log-likelihood is the relationship to information theory, which we will discuss in detail in :numref:`sec_information_theory`.  This is a rigorous mathematical theory which gives a way to measure the degree of information or randomness in a random variable.  The key object of study in that field is the entropy which is \n",
    "\n",
    "$$\n",
    "H(p) = -\\sum_{i} p_i \\log_2(p_i),\n",
    "$$\n",
    "\n",
    "which measures the randomness of a source. Notice that this is nothing more than the average $-\\log$ probability, and thus if we take our negative log-likelihood and divide by the number of data examples, we get a relative of entropy known as cross-entropy.  This theoretical interpretation alone would be sufficiently compelling to motivate reporting the average negative log-likelihood over the dataset as a way of measuring model performance.\n",
    "\n",
    "## Maximum Likelihood for Continuous Variables\n",
    "\n",
    "Everything that we have done so far assumes we are working with discrete random variables, but what if we want to work with continuous ones?\n",
    "\n",
    "The short summary is that nothing at all changes, except we replace all the instances of the probability with the probability density.  Recalling that we write densities with lower case $p$, this means that for example we now say\n",
    "\n",
    "$$\n",
    "-\\log\\left(p(X\\mid\\boldsymbol{\\theta})\\right) = -\\log(p(x_1\\mid\\boldsymbol{\\theta})) - \\log(p(x_2\\mid\\boldsymbol{\\theta})) \\cdots - \\log(p(x_n\\mid\\boldsymbol{\\theta})) = -\\sum_i \\log(p(x_i \\mid \\theta)).\n",
    "$$\n",
    "\n",
    "The question becomes, \"Why is this OK?\"  After all, the reason we introduced densities was because probabilities of getting specific outcomes themselves was zero, and thus is not the probability of generating our data for any set of parameters zero?\n",
    "\n",
    "Indeed, this is the case, and understanding why we can shift to densities is an exercise in tracing what happens to the epsilons.\n",
    "\n",
    "Let's first re-define our goal.  Suppose that for continuous random variables we no longer want to compute the probability of getting exactly the right value, but instead matching to within some range $\\epsilon$.  For simplicity, we assume our data is repeated observations $x_1, \\ldots, x_N$ of identically distributed random variables $X_1, \\ldots, X_N$.  As we have seen previously, this can be written as\n",
    "\n",
    "$$\n",
    "\\begin{aligned}\n",
    "&P(X_1 \\in [x_1, x_1+\\epsilon], X_2 \\in [x_2, x_2+\\epsilon], \\ldots, X_N \\in [x_N, x_N+\\epsilon]\\mid\\boldsymbol{\\theta}) \\\\\n",
    "\\approx &\\epsilon^Np(x_1\\mid\\boldsymbol{\\theta})\\cdot p(x_2\\mid\\boldsymbol{\\theta}) \\cdots p(x_n\\mid\\boldsymbol{\\theta}).\n",
    "\\end{aligned}\n",
    "$$\n",
    "\n",
    "Thus, if we take negative logarithms of this we obtain\n",
    "\n",
    "$$\n",
    "\\begin{aligned}\n",
    "&-\\log(P(X_1 \\in [x_1, x_1+\\epsilon], X_2 \\in [x_2, x_2+\\epsilon], \\ldots, X_N \\in [x_N, x_N+\\epsilon]\\mid\\boldsymbol{\\theta})) \\\\\n",
    "\\approx & -N\\log(\\epsilon) - \\sum_{i} \\log(p(x_i\\mid\\boldsymbol{\\theta})).\n",
    "\\end{aligned}\n",
    "$$\n",
    "\n",
    "If we examine this expression, the only place that the $\\epsilon$ occurs is in the additive constant $-N\\log(\\epsilon)$.  This does not depend on the parameters $\\boldsymbol{\\theta}$ at all, so the optimal choice of $\\boldsymbol{\\theta}$ does not depend on our choice of $\\epsilon$!  If we demand four digits or four-hundred, the best choice of $\\boldsymbol{\\theta}$ remains the same, thus we may freely drop the epsilon to see that what we want to optimize is\n",
    "\n",
    "$$\n",
    "- \\sum_{i} \\log(p(x_i\\mid\\boldsymbol{\\theta})).\n",
    "$$\n",
    "\n",
    "Thus, we see that the maximum likelihood point of view can operate with continuous random variables as easily as with discrete ones by replacing the probabilities with probability densities.\n",
    "\n",
    "## Summary\n",
    "* The maximum likelihood principle tells us that the best fit model for a given dataset is the one that generates the data with the highest probability.\n",
    "* Often people work with the negative log-likelihood instead for a variety of reasons: numerical stability, conversion of products to sums (and the resulting simplification of gradient computations), and theoretical ties to information theory.\n",
    "* While simplest to motivate in the discrete setting, it may be freely generalized to the continuous setting as well by maximizing the probability density assigned to the datapoints.\n",
    "\n",
    "## Exercises\n",
    "1. Suppose that you know that a non-negative random variable has density $\\alpha e^{-\\alpha x}$ for some value $\\alpha>0$.  You obtain a single observation from the random variable which is the number $3$.  What is the maximum likelihood estimate for $\\alpha$?\n",
    "2. Suppose that you have a dataset of samples $\\{x_i\\}_{i=1}^N$ drawn from a Gaussian with unknown mean, but variance $1$.  What is the maximum likelihood estimate for the mean?\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "25da6457",
   "metadata": {
    "origin_pos": 10,
    "tab": [
     "pytorch"
    ]
   },
   "source": [
    "[Discussions](https://discuss.d2l.ai/t/1096)\n"
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python"
  },
  "required_libs": []
 },
 "nbformat": 4,
 "nbformat_minor": 5
}